CPSC 330 Lecture 19: Time series

Varada Kolhatkar

Focus on the breath!

Announcements

  • HW8 has been released (due next week Monday)
    • Almost there! You’ve got this! 😊
  • Midterm 2 grading is in progress.

Recap: iClicker questions

    1. In multinomial logistic regression, the model learns a separate weight vector and bias for each class.
    1. Neural networks are powerful models, so it’s usually a good idea to start with them on any new machine learning problem.
    1. The main reason we add hidden layers is to allow the model to learn increasingly complex representations.
    1. Convolutional neural networks (CNNs) use filters that slide over the image to detect local patterns.
    1. Using a pre-trained network as a feature extractor typically requires less data than training a deep network from scratch.

Today’s lecture goals

  • What is time series?
  • How do we know a problem is a time series problem?
  • Why do standard ML models struggle with time-dependent data?
  • How can we adapt ML models to handle time series?

What type of model would be appropriate?

Scenario Model/Method
You have user–item ratings (e.g., movie ratings) and want to predict missing ratings. ?
You have a collection of documents without any labels and want to group them into themes. ?
You want to classify the emotion of a set of text messages, but you do not have any labeled data. ?
You have a small dataset with ~500 images containing pictures and names of 20 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. ?

Loan default prediction (tabular data)

You work for a financial institution and have a dataset where each row represents a customer applying for a loan. What type of model would you use?

customer_id income_k credit_utilization late_payments employment_length employment_type home_ownership loan_purpose default
1 95 22 0 9 salaried mortgage home_improvement 0
2 45 78 3 2 contract rent debt_consolidation 1
3 120 30 1 7 salaried own car 0
4 60 65 2 3 self_employed rent debt_consolidation 1
5 85 40 0 10 salaried mortgage education 0
6 55 90 4 1 contract rent debt_consolidation 1
7 130 28 0 6 salaried own car 0
8 40 82 2 1 self_employed rent debt_consolidation 1
  • Rows are independent \(\rightarrow\) order does not matter \(\rightarrow\) time does not matter

citibike dataset

  • You have bike rental counts every three hours for one station in New York City over a month. You want to predict demand for the next three-hour period.
starttime
2015-08-01 00:00:00     3
2015-08-01 03:00:00     0
2015-08-01 06:00:00     9
2015-08-01 09:00:00    41
2015-08-01 12:00:00    39
2015-08-01 15:00:00    27
2015-08-01 18:00:00    12
2015-08-01 21:00:00     4
2015-08-02 00:00:00     3
2015-08-02 03:00:00     4
2015-08-02 06:00:00     6
2015-08-02 09:00:00    30
2015-08-02 12:00:00    46
2015-08-02 15:00:00    27
2015-08-02 18:00:00    28
2015-08-02 21:00:00     6
2015-08-03 00:00:00     3
2015-08-03 03:00:00     2
2015-08-03 06:00:00    21
2015-08-03 09:00:00     9
Freq: 3h, Name: one, dtype: int64

citibike data

starttime
2015-08-01 00:00:00     3
2015-08-01 03:00:00     0
2015-08-01 06:00:00     9
2015-08-01 09:00:00    41
2015-08-01 12:00:00    39
2015-08-01 15:00:00    27
2015-08-01 18:00:00    12
2015-08-01 21:00:00     4
2015-08-02 00:00:00     3
2015-08-02 03:00:00     4
2015-08-02 06:00:00     6
2015-08-02 09:00:00    30
2015-08-02 12:00:00    46
2015-08-02 15:00:00    27
2015-08-02 18:00:00    28
2015-08-02 21:00:00     6
2015-08-03 00:00:00     3
2015-08-03 03:00:00     2
2015-08-03 06:00:00    21
2015-08-03 09:00:00     9
Freq: 3h, Name: one, dtype: int64
  • Only feature: datetime (e.g., 2015-08-01 00:00:00)
  • The data is collected at regular intervals (every three hours)
  • Target: rentals in the next 3-hour period (e.g., 9 rentals between 2015-08-01 06:00:00 and 2015-08-01 09:00:00)
  • Goal: Given past rental counts, predict the number of rentals at a specific future time.

Using only the tools in your current toolbox, what model would you choose, and what challenges might you run into?

Why different treatement?

starttime
2015-08-01 00:00:00     3
2015-08-01 03:00:00     0
2015-08-01 06:00:00     9
2015-08-01 09:00:00    41
2015-08-01 12:00:00    39
2015-08-01 15:00:00    27
2015-08-01 18:00:00    12
2015-08-01 21:00:00     4
2015-08-02 00:00:00     3
2015-08-02 03:00:00     4
2015-08-02 06:00:00     6
2015-08-02 09:00:00    30
2015-08-02 12:00:00    46
2015-08-02 15:00:00    27
2015-08-02 18:00:00    28
2015-08-02 21:00:00     6
2015-08-03 00:00:00     3
2015-08-03 03:00:00     2
2015-08-03 06:00:00    21
2015-08-03 09:00:00     9
Freq: 3h, Name: one, dtype: int64
  • This type of data is distinctive because it is inherently sequential, with an intrinsic order based on time.
  • The number of bikes available at a station at one point in time is often related to the number of bikes at earlier times.
  • This is a time-series forecasting problem.

Models for time series

The ML models we’ve used so far do not have a built-in concept of time. There are two broad strategies for modeling time series:

  • Use models designed for sequential data which explicitly capture temporal dependencies (e.g., Hidden Markov Models, Recurrent Neural Networks, transformer architectures
  • Use tabular ML models with engineered temporal features (e.g., Linear models, Random Forests, Gradient Boosted Trees)
    • Requires creating features that capture temporal information.
    • This allows us to reuse the familiar models in our toolbox.

citibike data visualization

Start date: 2015-08-01 00:00:00
End date: 2015-08-31 21:00:00

  • Do you see any daily patterns? Weekly patterns? Noise?

⛔️ Incorrect data splitting

train_df, test_df = train_test_split(citibike, test_size=0.2, random_state=123)
print('Train largest date: ', train_df.index.max())
print('Test smallest date: ', test_df.index.min())
Train largest date:  2015-08-31 21:00:00
Test smallest date:  2015-08-01 12:00:00

⛔️ We should never train on the future to predict the past!

✅ Correct data splitting

In time series, the simplest split is:

  • earlier data \(\rightarrow\) training and later data \(\rightarrow\) testing
# Example split
n_train = 184
train_df = citibike[:184]
test_df = citibike[184:]

Feature engineering for time series

Motivation

  • In this toy data, we just have a single feature: the date time feature.
  • Note that ML models do not have a built-in concept of time. We have to give it to them.
  • We will explore different ways to extract informative features from time.

POSIX time feature

  • Let’s start with our worst but simplest encoding.
  • A common way that dates are stored on computers is using POSIX time, which is the number of seconds since January 1970 00:00:00 (this is beginning of Unix time).
  • Let’s start with encoding feature as a single integer representing this POSIX time.
# convert to POSIX time by dividing by 10**9
X = (
    citibike.index.astype("int64").values.reshape(-1, 1) // 10**9
)  # convert to POSIX time by dividing by 10**9
y = citibike.values
X[:10]
array([[1438387200],
       [1438398000],
       [1438408800],
       [1438419600],
       [1438430400],
       [1438441200],
       [1438452000],
       [1438462800],
       [1438473600],
       [1438484400]])

Random forest on posix features

regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X, y, regressor, xticks, feat_names="POSIX time")
Train-set R^2: 0.85
Test-set R^2: -0.04

  • The predictions on the training data and training score are pretty good
  • But for the test data, a constant line is predicted …
  • What’s going on?

Trees cannot extrapolate!

  • Tree-based models (Decision Trees, Random Forests, Gradient Boosted Trees) only make predictions within the range of values they’ve seen during training.
  • They are excellent interpolators but terrible extrapolators because
    • Trees partition the feature space into fixed regions and predictions inside each region are averages of training labels.
    • If your future timestamps are larger than the ones in the training set trees cannot “see beyond” the training range and they will flatline or behave unpredictably.

This is exactly what happens with POSIX time encoded as a single numeric feature!

Extracting date and time information

  • Note that our index is of this special type: DateTimeIndex. We can extract all kinds of interesting information from it.
print(citibike.index[0])
print(citibike.index[0].month_name())
print(citibike.index[0].dayofweek)
print(citibike.index[0].hour)
2015-08-01 00:00:00
August
5
0

Time of the day

  • We noted before that the time of the day and day of the week seem quite important.
  • Let’s start with time of the day.
X_hour = citibike.index.hour.values.reshape(-1, 1)
X_hour[:10]
array([[ 0],
       [ 3],
       [ 6],
       [ 9],
       [12],
       [15],
       [18],
       [21],
       [ 0],
       [ 3]], dtype=int32)

Random forest with time of the day

regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X_hour, y, regressor, xticks, feat_names="Hour of the day")
Train-set R^2: 0.50
Test-set R^2: 0.60

The scores are better when we add time of the day feature!

Time of the day + Day of the week

Now let’s add day of the week along with time of the day.

X_hour_week = np.hstack(
    [
        citibike.index.dayofweek.values.reshape(-1, 1),
        citibike.index.hour.values.reshape(-1, 1),
    ]
)
X_hour_week
array([[ 5,  0],
       [ 5,  3],
       [ 5,  6],
       [ 5,  9],
       [ 5, 12],
       [ 5, 15],
       [ 5, 18],
       [ 5, 21],
       [ 6,  0],
       [ 6,  3],
       [ 6,  6],
       [ 6,  9],
       [ 6, 12],
       [ 6, 15],
       [ 6, 18],
       [ 6, 21],
       [ 0,  0],
       [ 0,  3],
       [ 0,  6],
       [ 0,  9],
       [ 0, 12],
       [ 0, 15],
       [ 0, 18],
       [ 0, 21],
       [ 1,  0],
       [ 1,  3],
       [ 1,  6],
       [ 1,  9],
       [ 1, 12],
       [ 1, 15],
       [ 1, 18],
       [ 1, 21],
       [ 2,  0],
       [ 2,  3],
       [ 2,  6],
       [ 2,  9],
       [ 2, 12],
       [ 2, 15],
       [ 2, 18],
       [ 2, 21],
       [ 3,  0],
       [ 3,  3],
       [ 3,  6],
       [ 3,  9],
       [ 3, 12],
       [ 3, 15],
       [ 3, 18],
       [ 3, 21],
       [ 4,  0],
       [ 4,  3],
       [ 4,  6],
       [ 4,  9],
       [ 4, 12],
       [ 4, 15],
       [ 4, 18],
       [ 4, 21],
       [ 5,  0],
       [ 5,  3],
       [ 5,  6],
       [ 5,  9],
       [ 5, 12],
       [ 5, 15],
       [ 5, 18],
       [ 5, 21],
       [ 6,  0],
       [ 6,  3],
       [ 6,  6],
       [ 6,  9],
       [ 6, 12],
       [ 6, 15],
       [ 6, 18],
       [ 6, 21],
       [ 0,  0],
       [ 0,  3],
       [ 0,  6],
       [ 0,  9],
       [ 0, 12],
       [ 0, 15],
       [ 0, 18],
       [ 0, 21],
       [ 1,  0],
       [ 1,  3],
       [ 1,  6],
       [ 1,  9],
       [ 1, 12],
       [ 1, 15],
       [ 1, 18],
       [ 1, 21],
       [ 2,  0],
       [ 2,  3],
       [ 2,  6],
       [ 2,  9],
       [ 2, 12],
       [ 2, 15],
       [ 2, 18],
       [ 2, 21],
       [ 3,  0],
       [ 3,  3],
       [ 3,  6],
       [ 3,  9],
       [ 3, 12],
       [ 3, 15],
       [ 3, 18],
       [ 3, 21],
       [ 4,  0],
       [ 4,  3],
       [ 4,  6],
       [ 4,  9],
       [ 4, 12],
       [ 4, 15],
       [ 4, 18],
       [ 4, 21],
       [ 5,  0],
       [ 5,  3],
       [ 5,  6],
       [ 5,  9],
       [ 5, 12],
       [ 5, 15],
       [ 5, 18],
       [ 5, 21],
       [ 6,  0],
       [ 6,  3],
       [ 6,  6],
       [ 6,  9],
       [ 6, 12],
       [ 6, 15],
       [ 6, 18],
       [ 6, 21],
       [ 0,  0],
       [ 0,  3],
       [ 0,  6],
       [ 0,  9],
       [ 0, 12],
       [ 0, 15],
       [ 0, 18],
       [ 0, 21],
       [ 1,  0],
       [ 1,  3],
       [ 1,  6],
       [ 1,  9],
       [ 1, 12],
       [ 1, 15],
       [ 1, 18],
       [ 1, 21],
       [ 2,  0],
       [ 2,  3],
       [ 2,  6],
       [ 2,  9],
       [ 2, 12],
       [ 2, 15],
       [ 2, 18],
       [ 2, 21],
       [ 3,  0],
       [ 3,  3],
       [ 3,  6],
       [ 3,  9],
       [ 3, 12],
       [ 3, 15],
       [ 3, 18],
       [ 3, 21],
       [ 4,  0],
       [ 4,  3],
       [ 4,  6],
       [ 4,  9],
       [ 4, 12],
       [ 4, 15],
       [ 4, 18],
       [ 4, 21],
       [ 5,  0],
       [ 5,  3],
       [ 5,  6],
       [ 5,  9],
       [ 5, 12],
       [ 5, 15],
       [ 5, 18],
       [ 5, 21],
       [ 6,  0],
       [ 6,  3],
       [ 6,  6],
       [ 6,  9],
       [ 6, 12],
       [ 6, 15],
       [ 6, 18],
       [ 6, 21],
       [ 0,  0],
       [ 0,  3],
       [ 0,  6],
       [ 0,  9],
       [ 0, 12],
       [ 0, 15],
       [ 0, 18],
       [ 0, 21],
       [ 1,  0],
       [ 1,  3],
       [ 1,  6],
       [ 1,  9],
       [ 1, 12],
       [ 1, 15],
       [ 1, 18],
       [ 1, 21],
       [ 2,  0],
       [ 2,  3],
       [ 2,  6],
       [ 2,  9],
       [ 2, 12],
       [ 2, 15],
       [ 2, 18],
       [ 2, 21],
       [ 3,  0],
       [ 3,  3],
       [ 3,  6],
       [ 3,  9],
       [ 3, 12],
       [ 3, 15],
       [ 3, 18],
       [ 3, 21],
       [ 4,  0],
       [ 4,  3],
       [ 4,  6],
       [ 4,  9],
       [ 4, 12],
       [ 4, 15],
       [ 4, 18],
       [ 4, 21],
       [ 5,  0],
       [ 5,  3],
       [ 5,  6],
       [ 5,  9],
       [ 5, 12],
       [ 5, 15],
       [ 5, 18],
       [ 5, 21],
       [ 6,  0],
       [ 6,  3],
       [ 6,  6],
       [ 6,  9],
       [ 6, 12],
       [ 6, 15],
       [ 6, 18],
       [ 6, 21],
       [ 0,  0],
       [ 0,  3],
       [ 0,  6],
       [ 0,  9],
       [ 0, 12],
       [ 0, 15],
       [ 0, 18],
       [ 0, 21]], dtype=int32)

Random forest with Time of the day + Day of the week

eval_on_features(X_hour_week, y, regressor, xticks, feat_names = "hour of day + day of week")
Train-set R^2: 0.89
Test-set R^2: 0.84

The time of the day and day of the week features are clearly helping.

Linear model

Let’s try an interpretable linear model Ridge with these features.

Train-set R^2: 0.16
Test-set R^2: 0.13

  • Why is Ridge performing poorly on the training data as well as test data?

Encoding time and day with OHE

enc = OneHotEncoder()
X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()
hour = ["%02d:00" % i for i in range(0, 24, 3)]
day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
features = day + hour
pd.DataFrame(X_hour_week_onehot, columns=features).head(6)
Mon Tue Wed Thu Fri Sat Sun 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00
0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

Linear model with OHE day and time

Let’s try an interpretable linear model Ridge with these features.

eval_on_features(X_hour_week_onehot, y, Ridge(), xticks, feat_names="hour of day OHE + day of week OHE")
Train-set R^2: 0.53
Test-set R^2: 0.62

  • The scores are a bit better!
  • Can we improve them further?

Add interaction features

Mon Tue Wed Thu Fri Sat Tue 00:00 Tue 03:00 Tue 06:00 Tue 09:00 Tue 18:00
0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

Linear model with OHE day and time + interaction feats

Train-set R^2: 0.87
Test-set R^2: 0.85

The scores are much better now!!

Interpretation

Since we are using a linear model, we can examine the coefficients learned by Ridge.

Coefficient
Sat 09:00 15.196739
Wed 06:00 15.005809
Sat 12:00 13.437684
Sun 12:00 13.362009
Thu 06:00 10.907595
... ...
Sat 21:00 -6.085150
00:00 -11.693898
03:00 -12.111220
Sat 06:00 -13.757591
Sun 06:00 -18.033267

71 rows × 1 columns

Do these coefficients make sense?

Interim summary

  • Success in time-series analysis heavily relies on the appropriate choice of models and features.
  • Tree-based models cannot extrapolate; caution is needed when using them with linear integer features.
  • Linear models struggle with cyclic patterns in numeric features (e.g., numerically encoded time of the day feature) because these patterns are inherently non-linear.
  • Applying one-hot encoding on such features transforms cyclic temporal features into a format where their impact on the target variable can be independently and linearly modeled, enabling linear models to effectively capture and use these cyclic patterns.

Lag-based features

n_rentals
starttime
2015-08-01 00:00:00 3
2015-08-01 03:00:00 0
2015-08-01 06:00:00 9
2015-08-01 09:00:00 41
2015-08-01 12:00:00 39
2015-08-01 15:00:00 27
2015-08-01 18:00:00 12
2015-08-01 21:00:00 4
2015-08-02 00:00:00 3
2015-08-02 03:00:00 4
  • In time series data there is temporal dependence; observations close in time tend to be correlated.
  • Currently we’re using current time to predict the number of bike rentals in the next three hours.
  • But, what if the number of bike rentals is also related to bike rentals three hours ago or 6 hours ago and so on?

Such features are called lagged features.

Creating lag features

def create_lag_df(df, lag, cols):
    return df.assign(
        **{f"{col}-{n}": df[col].shift(n) for n in range(1, lag + 1) for col in cols}
    )
rentals_lag5 = create_lag_df(rentals_df, 5, ['n_rentals'] )
rentals_lag5.head(8)
n_rentals n_rentals-1 n_rentals-2 n_rentals-3 n_rentals-4 n_rentals-5
starttime
2015-08-01 00:00:00 3 NaN NaN NaN NaN NaN
2015-08-01 03:00:00 0 3.0 NaN NaN NaN NaN
2015-08-01 06:00:00 9 0.0 3.0 NaN NaN NaN
2015-08-01 09:00:00 41 9.0 0.0 3.0 NaN NaN
2015-08-01 12:00:00 39 41.0 9.0 0.0 3.0 NaN
2015-08-01 15:00:00 27 39.0 41.0 9.0 0.0 3.0
2015-08-01 18:00:00 12 27.0 39.0 41.0 9.0 0.0
2015-08-01 21:00:00 4 12.0 27.0 39.0 41.0 9.0

Linear model with lag features

Train-set R^2: 0.25
Test-set R^2: 0.37

Random Forest with lag features

Train-set R^2: 0.94
Test-set R^2: 0.69

Random Forest with time and day + lag features

Train-set R^2: 0.95
Test-set R^2: 0.78

Cross-validation with time series

  • We can’t do regular cross-validation if we don’t want to be predicting the past.

  • There is TimeSeriesSplit for time series data.

[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

Forecasting further into the future

Problem

citibike.index.max()
Timestamp('2015-08-31 21:00:00')
  • So far, our lag features let us predict 3 hours ahead.
  • What if we want to predict 15 hours in the future? In other words, what if we want to predict number of rentals for the next three hours at time 2015-09-01 12:00:00?
  • Problem: We do not yet know the rental counts for the required lag timestamps:
    • 2015-09-01 00:00:00
    • 2015-09-01 03:00:00
    • 2015-09-01 06:00:00
    • 2015-09-01 09:00:00
  • Without those values, our lag features break 😢

Approach 1: Iterative forecasting

  • Train one model that predicts 3 hours ahead.
  • At prediction time, move forward step by step, using your own predictions as future lag inputs.

Example:

  • Predict rentals at 00:00 on 2015-09-01.
  • Use that prediction as the lag to predict rentals at 03:00.
  • Use both predictions to predict rentals at 06:00.
  • Continue until you reach 12:00.

This method works, but errors accumulate as we step forward. The longer the horizon, the more uncertainty grows.

Approach 2: Direct forecasting (multiple horizons)

  • Train separate models for each horizon:
    • Model 1 \(\rightarrow\) predict 3 hours ahead
    • Model 2 \(\rightarrow\) predict 6 hours ahead
    • Model 3 \(\rightarrow\) predict 9 hours ahead

Each model uses lag features that match the required horizon.

(Optional) Approach 3: Multi-output models (joint forecasting)

  • Train one model that predicts several future steps at once, e.g.:

    y = [rentals_in_3h, rentals_in_6h, rentals_in_9h, ...]

  • These models learn relationships across future time steps.

  • Note: Multi-output forecasting is powerful, but outside the scope of CPSC 330.

  • One idea is to create a feature such as “Days_since”
date sales sales-1 sales-2 sales-3 sales-4 sales-5 Days_since
0 1992-01-01 6938 NaN NaN NaN NaN NaN 0
1 1992-02-01 7524 6938.0 NaN NaN NaN NaN 31
2 1992-03-01 8475 7524.0 6938.0 NaN NaN NaN 60
3 1992-04-01 9401 8475.0 7524.0 6938.0 NaN NaN 91
4 1992-05-01 9558 9401.0 8475.0 7524.0 6938.0 NaN 121
5 1992-06-01 9182 9558.0 9401.0 8475.0 7524.0 6938.0 152
6 1992-07-01 9103 9182.0 9558.0 9401.0 8475.0 7524.0 182
7 1992-08-01 10513 9103.0 9182.0 9558.0 9401.0 8475.0 213
8 1992-09-01 9573 10513.0 9103.0 9182.0 9558.0 9401.0 244
9 1992-10-01 10254 9573.0 10513.0 9103.0 9182.0 9558.0 274

Class demo